battle_of_bastards

Game of Thrones - Battles Analysis and Predicting Victory

Game of Thrones is a popular fantasy TV show based on a series of books written by George RR Martin.This notebook showcases the analysis and predictions of the battles in the series.

As a fan of the hit TV show, I have been amazed by the battle scenes in every season. One of my favorite battles is the 'Battle of the Bastards" in Season 6, Episode 9. What I like about this battle is that Jon Snow managed to convince the Free Folks into joining his army and fight against House Bolton. This episode was well directed in terms of the number of people involved in sequencing the battle and there were many detailed attacks in the episodes.

My notebook analyzes Chris Albon’s “The War of the Five Kings” Dataset, which can be found here. It’s a great collection of all of the battles in the series.

Key Questions

I plan on tackling three key questions from the battles dataset:

1) Which house wins the most battles in any situation?
2) What is the expected size of the defending army given the size of the attacking army?
3) What factors contribute to a battle victory?

Load packages

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
import seaborn as sns
from pylab import rcParams
from collections import Counter
from time import time
from pandas_profiling import ProfileReport
from IPython.display import display
import statsmodels.api as sm
import plotly.express as px
# Import supplementary visualization code visuals.py
import visuals as vs
rcParams['figure.figsize'] = 12, 9
plt.style.use('ggplot')
In [2]:
# @hidden_cell
import warnings
warnings.filterwarnings("ignore")

Load dataset

In [3]:
battles_df = pd.read_csv('../data/battles.csv')
battles_df.head()
Out[3]:
name year battle_number attacker_king defender_king attacker_1 attacker_2 attacker_3 attacker_4 defender_1 ... major_death major_capture attacker_size defender_size attacker_commander defender_commander summer location region note
0 Battle of Winterfell 299 12 Balon/Euron Greyjoy Robb Stark Greyjoy NaN NaN NaN Stark ... 0.0 1.0 20.0 NaN Theon Greyjoy Bran Stark 1.0 Winterfell The North It isn't mentioned how many Stark men are left...
1 Sack of Harrenhal 299 18 Robb Stark Joffrey/Tommen Baratheon Stark NaN NaN NaN Lannister ... 1.0 0.0 100.0 100.0 Roose Bolton, Vargo Hoat, Robett Glover Amory Lorch 1.0 Harrenhal The Riverlands NaN
2 Battle of Torrhen's Square 299 11 Robb Stark Balon/Euron Greyjoy Stark NaN NaN NaN Greyjoy ... 0.0 0.0 244.0 900.0 Rodrik Cassel, Cley Cerwyn Dagmer Cleftjaw 1.0 Torrhen's Square The North Greyjoy's troop number comes from the 264 esti...
3 Battle of the Stony Shore 299 10 Balon/Euron Greyjoy Robb Stark Greyjoy NaN NaN NaN Stark ... 0.0 0.0 264.0 NaN Theon Greyjoy NaN 1.0 Stony Shore The North Greyjoy's troop number based on the Battle of ...
4 Sack of Winterfell 299 14 Joffrey/Tommen Baratheon Robb Stark Bolton Greyjoy NaN NaN Stark ... 1.0 0.0 618.0 2000.0 Ramsay Snow, Theon Greyjoy Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 1.0 Winterfell The North Since House Bolton betrays the Starks for Hous...

5 rows × 25 columns

In reviewing the other kernels to see what has been done, one particular kernel on Kaggle pointed out the data entry mistake on the Battle of Castle Rock. Having watched the TV series, I know for a fact that Mance Rayder has 100K wildings and Stannis Baratheon has 1,240 troops. I flipped the names in the dataset. This should be a major callout to anyone using this dataset

attacker and defender names are dropped in order to look at the dataset clearly.

In [4]:
battles_df.drop(['attacker_1','attacker_2','attacker_3','attacker_4','defender_1','defender_2','defender_3','defender_4','note'],axis=1).head()
Out[4]:
name year battle_number attacker_king defender_king attacker_outcome battle_type major_death major_capture attacker_size defender_size attacker_commander defender_commander summer location region
0 Battle of Winterfell 299 12 Balon/Euron Greyjoy Robb Stark win ambush 0.0 1.0 20.0 NaN Theon Greyjoy Bran Stark 1.0 Winterfell The North
1 Sack of Harrenhal 299 18 Robb Stark Joffrey/Tommen Baratheon win ambush 1.0 0.0 100.0 100.0 Roose Bolton, Vargo Hoat, Robett Glover Amory Lorch 1.0 Harrenhal The Riverlands
2 Battle of Torrhen's Square 299 11 Robb Stark Balon/Euron Greyjoy win pitched battle 0.0 0.0 244.0 900.0 Rodrik Cassel, Cley Cerwyn Dagmer Cleftjaw 1.0 Torrhen's Square The North
3 Battle of the Stony Shore 299 10 Balon/Euron Greyjoy Robb Stark win ambush 0.0 0.0 264.0 NaN Theon Greyjoy NaN 1.0 Stony Shore The North
4 Sack of Winterfell 299 14 Joffrey/Tommen Baratheon Robb Stark win ambush 1.0 0.0 618.0 2000.0 Ramsay Snow, Theon Greyjoy Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 1.0 Winterfell The North

Data Cleaning

For this step, I will change attacker_outcome to boolean and fill nan with zeros for attacker_outcome, major_death, major_capture, summer.

In [5]:
battles_df['attacker_outcome_flag'] = battles_df['attacker_outcome'].map({'win': 1, 'loss': 0})
battles_df['attacker_outcome_flag'] = battles_df['attacker_outcome_flag'].fillna(0)
battles_df['major_death'] = battles_df['major_death'].fillna(0)
battles_df['major_capture'] = battles_df['major_capture'].fillna(0)
battles_df['summer'] = battles_df['summer'].fillna(0)

#Check dataset
battles_df[['attacker_outcome','major_death','major_capture','summer']].head()
Out[5]:
attacker_outcome major_death major_capture summer
0 win 0.0 1.0 1.0
1 win 1.0 0.0 1.0
2 win 0.0 0.0 1.0
3 win 0.0 0.0 1.0
4 win 1.0 0.0 1.0

Run pandas profiler for fast EDA

Whenever I start a project, I like to run the dataset through Pandas Profiler.

In [6]:
profile = ProfileReport(battles_df, title='Game of Thrones Battles - Pandas Profiling Report', style={'full_width':True})
profile
Out[6]:

In [7]:
profile.to_file(output_file="../output/got_battles_data_profile.html")

The columns attacker_2,attacker_3,attacker_4,defender_2,defender_3,defender_4, attacker_commander, defender_commander can be used to count the number of houses involved in the battle.

Five columns will be created for:

  • Number of attacking houses
  • Number of defending houses
  • Number of attacker_commander
  • Number of defender_commander
  • Battle Size
In [8]:
battles_df['attack_houses'] = battles_df[['attacker_1','attacker_2','attacker_3','attacker_4']].notnull().sum(axis=1)
battles_df['attack_houses'] = pd.to_numeric(battles_df.attack_houses)

battles_df['defender_houses'] = battles_df[['defender_1','defender_2','defender_3','defender_4']].notnull().sum(axis=1)
battles_df['defender_houses'] = pd.to_numeric(battles_df.defender_houses)

# Check data
battles_df[['attacker_1','attacker_2','attacker_3','attacker_4','attack_houses','defender_1','defender_2','defender_3','defender_4','defender_houses']].sort_values(by=['attack_houses','defender_houses'],ascending=[False,False]).head()
Out[8]:
attacker_1 attacker_2 attacker_3 attacker_4 attack_houses defender_1 defender_2 defender_3 defender_4 defender_houses
14 Baratheon Karstark Mormont Glover 4 Bolton Frey NaN NaN 2
12 Baratheon Karstark Mormont Glover 4 Greyjoy NaN NaN NaN 1
23 Free folk Thenns Giants NaN 3 Night's Watch Baratheon NaN NaN 2
4 Bolton Greyjoy NaN NaN 2 Stark NaN NaN NaN 1
6 Bracken Lannister NaN NaN 2 Blackwood NaN NaN NaN 1

Count occurence of attacker_commander and defender_commander

In [9]:
battles_df['attacker_commander'].str.split(',', expand=True).head()
Out[9]:
0 1 2 3 4 5
0 Theon Greyjoy None None None None None
1 Roose Bolton Vargo Hoat Robett Glover None None None
2 Rodrik Cassel Cley Cerwyn None None None None
3 Theon Greyjoy None None None None None
4 Ramsay Snow Theon Greyjoy None None None None
In [10]:
battles_df['attacker_commander_count'] = battles_df['attacker_commander'].str.split(',', expand=True).notnull().sum(axis=1)
battles_df[['attacker_commander','attacker_commander_count']].head()
Out[10]:
attacker_commander attacker_commander_count
0 Theon Greyjoy 1
1 Roose Bolton, Vargo Hoat, Robett Glover 3
2 Rodrik Cassel, Cley Cerwyn 2
3 Theon Greyjoy 1
4 Ramsay Snow, Theon Greyjoy 2
In [11]:
battles_df['defender_commander'].str.split(',', expand=True).head()
Out[11]:
0 1 2 3 4 5 6
0 Bran Stark None None None None None None
1 Amory Lorch None None None None None None
2 Dagmer Cleftjaw None None None None None None
3 NaN NaN NaN NaN NaN NaN NaN
4 Rodrik Cassel Cley Cerwyn Leobald Tallhart None None None None
In [12]:
battles_df['defender_commander_count'] = battles_df['defender_commander'].str.split(',', expand=True).notnull().sum(axis=1)
battles_df[['defender_commander','defender_commander_count']].head()
Out[12]:
defender_commander defender_commander_count
0 Bran Stark 1
1 Amory Lorch 1
2 Dagmer Cleftjaw 1
3 NaN 0
4 Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 3

Drop columns with missing data

In [13]:
battles_df = battles_df.drop(columns = ['battle_number','attacker_2','attacker_3','attacker_4','defender_2','defender_3','defender_4','note'])
battles_df.head()
Out[13]:
name year attacker_king defender_king attacker_1 defender_1 attacker_outcome battle_type major_death major_capture ... attacker_commander defender_commander summer location region attacker_outcome_flag attack_houses defender_houses attacker_commander_count defender_commander_count
0 Battle of Winterfell 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark win ambush 0.0 1.0 ... Theon Greyjoy Bran Stark 1.0 Winterfell The North 1.0 1 1 1 1
1 Sack of Harrenhal 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister win ambush 1.0 0.0 ... Roose Bolton, Vargo Hoat, Robett Glover Amory Lorch 1.0 Harrenhal The Riverlands 1.0 1 1 3 1
2 Battle of Torrhen's Square 299 Robb Stark Balon/Euron Greyjoy Stark Greyjoy win pitched battle 0.0 0.0 ... Rodrik Cassel, Cley Cerwyn Dagmer Cleftjaw 1.0 Torrhen's Square The North 1.0 1 1 2 1
3 Battle of the Stony Shore 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark win ambush 0.0 0.0 ... Theon Greyjoy NaN 1.0 Stony Shore The North 1.0 1 1 1 0
4 Sack of Winterfell 299 Joffrey/Tommen Baratheon Robb Stark Bolton Stark win ambush 1.0 0.0 ... Ramsay Snow, Theon Greyjoy Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 1.0 Winterfell The North 1.0 2 1 2 3

5 rows × 22 columns

Create battle_size for the total number of people involved in a battle.

In [14]:
battles_df['battle_size'] = battles_df['attacker_size'] + battles_df['defender_size']
battles_df[['attacker_size','defender_size','battle_size']].head()
Out[14]:
attacker_size defender_size battle_size
0 20.0 NaN NaN
1 100.0 100.0 200.0
2 244.0 900.0 1144.0
3 264.0 NaN NaN
4 618.0 2000.0 2618.0

Plot correlation

In [15]:
corr_plot = battles_df.corr(method='pearson').style.set_caption('Correlation for Game of Thrones Battles').background_gradient(cmap='coolwarm').set_precision(4)
corr_plot
Out[15]:
Correlation for Game of Thrones Battles
year major_death major_capture attacker_size defender_size summer attacker_outcome_flag attack_houses defender_houses attacker_commander_count defender_commander_count battle_size
year 1 -0.3563 -0.1841 0.1558 -0.366 -0.8105 -0.03909 0.3188 0.1238 -0.005841 -0.2217 0.2017
major_death -0.3563 1 0.2736 0.2726 0.06158 0.3706 -0.2962 0.06181 0.1305 0.3308 0.5853 0.1881
major_capture -0.1841 0.2736 1 0.3355 0.234 0.184 -0.201 0.1234 0.1474 0.3088 0.3435 0.4066
attacker_size 0.1558 0.2726 0.3355 1 -0.1121 -0.2731 -0.5202 0.2364 0.6462 0.5371 0.4528 0.9652
defender_size -0.366 0.06158 0.234 -0.1121 1 0.3255 -0.2831 -0.199 -0.1024 0.258 0.5157 0.1515
summer -0.8105 0.3706 0.184 -0.2731 0.3255 1 0.01634 -0.3823 -0.1385 0.02564 0.1982 -0.38
attacker_outcome_flag -0.03909 -0.2962 -0.201 -0.5202 -0.2831 0.01634 1 -0.2437 -0.4752 -0.6564 -0.5795 -0.6039
attack_houses 0.3188 0.06181 0.1234 0.2364 -0.199 -0.3823 -0.2437 1 0.5559 0.1262 0.09444 0.08775
defender_houses 0.1238 0.1305 0.1474 0.6462 -0.1024 -0.1385 -0.4752 0.5559 1 0.1988 0.2179 0.5808
attacker_commander_count -0.005841 0.3308 0.3088 0.5371 0.258 0.02564 -0.6564 0.1262 0.1988 1 0.7026 0.5468
defender_commander_count -0.2217 0.5853 0.3435 0.4528 0.5157 0.1982 -0.5795 0.09444 0.2179 0.7026 1 0.5028
battle_size 0.2017 0.1881 0.4066 0.9652 0.1515 -0.38 -0.6039 0.08775 0.5808 0.5468 0.5028 1

Write function to clean battles dataset

In [16]:
def clean_battle_data(df):
    df['attacker_outcome_flag'] = df['attacker_outcome'].map({'win': 1, 'loss': 0})

    # Fill NaN with zero
    df['attacker_outcome_flag'] = df['attacker_outcome_flag'].fillna(0)
    df['major_death'] = df['major_death'].fillna(0)
    df['major_capture'] = df['major_capture'].fillna(0)
    df['summer'] = df['summer'].fillna(0)
    df['attacker_size'] = df['attacker_size'].fillna(0)
    df['defender_size'] = df['defender_size'].fillna(0)

    # The columns attacker_2,attacker_3,attacker_4,defender_2,defender_3,defender_4, attacker_commander, defender_commander can be used to count the number of houses involved in the battle.
    df['attack_houses'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].notnull().sum(axis=1)
    df['attack_houses'] = pd.to_numeric(df.attack_houses)

    df['defender_houses'] = df[['defender_1','defender_2','defender_3','defender_4']].notnull().sum(axis=1)
    df['defender_houses'] = pd.to_numeric(df.defender_houses)

    # Count attacker_commander
    df['attacker_commander_count'] = df['attacker_commander'].str.split(',', expand=True).notnull().sum(axis=1)

    # Count defender_commander
    df['defender_commander_count'] = df['defender_commander'].str.split(',', expand=True).notnull().sum(axis=1)

    # Drop columns with missing data
    df = df.drop(columns = ['battle_number','attacker_2','attacker_3','attacker_4','defender_2','defender_3','defender_4','note'])

    # Create battle_size columns
    df['battle_size'] = df['attacker_size'] + df['defender_size']
    df['battle_size'] = df['battle_size'].fillna(0)

    return df

Test function

In [17]:
battles_df = pd.read_csv('../data/battles.csv')
battles_df = clean_battle_data(battles_df)
battles_df.head()
Out[17]:
name year attacker_king defender_king attacker_1 defender_1 attacker_outcome battle_type major_death major_capture ... defender_commander summer location region attacker_outcome_flag attack_houses defender_houses attacker_commander_count defender_commander_count battle_size
0 Battle of Winterfell 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark win ambush 0.0 1.0 ... Bran Stark 1.0 Winterfell The North 1.0 1 1 1 1 20.0
1 Sack of Harrenhal 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister win ambush 1.0 0.0 ... Amory Lorch 1.0 Harrenhal The Riverlands 1.0 1 1 3 1 200.0
2 Battle of Torrhen's Square 299 Robb Stark Balon/Euron Greyjoy Stark Greyjoy win pitched battle 0.0 0.0 ... Dagmer Cleftjaw 1.0 Torrhen's Square The North 1.0 1 1 2 1 1144.0
3 Battle of the Stony Shore 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark win ambush 0.0 0.0 ... NaN 1.0 Stony Shore The North 1.0 1 1 1 0 264.0
4 Sack of Winterfell 299 Joffrey/Tommen Baratheon Robb Stark Bolton Stark win ambush 1.0 0.0 ... Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 1.0 Winterfell The North 1.0 2 1 2 3 2618.0

5 rows × 23 columns

Export clean data

In [18]:
battles_df.to_csv('../data/battles_clean.csv', index = False)
In [19]:
![dohkrhta](../img/dohkrata.jpg)
/bin/sh: -c: line 0: syntax error near unexpected token `../img/dohkrata.jpg'
/bin/sh: -c: line 0: `[dohkrhta](../img/dohkrata.jpg)'

Exploratory Data Analysis

In [20]:
profile = ProfileReport(battles_df, title='Game of Thrones Battles - Pandas Profiling Report', style={'full_width':True})
profile.to_file(output_file="../output/got_battles_data_profile.html")
profile
Out[20]:

Using the Fast EDA (one dimension):

attacker_outcome
32 battles out of 38 battles were won (84.2%).

attacker_king
On the offense, Joffrey/Tommen Baratheon were the attacking kings 36.8% of the time (14 battles) while Mance Rayder was only the attacking king once (5.3%).

defender_king
On the defense, Robb Stark has been attacked 36.8% of the times (14 battles) while Joffrey/Tommen Baratheon were second (34.2% or 13 battles). Renly Baratheon defended once (2.6%).

battle_type
We can see that the most common battle_type is pitched battle, appearing 36.8% (14 times) while razing was only 5.3% (twice).

region
Most of the battles were fought in The Riverlands (44.7% or 17 battles) while the second most battles fought were in The North (26.3% or 10 battles). There was only one battle Beyond The Wall (2.6%).

summer
Most of the battles were fought in the summer (26 or 68.4%) while the remaing were fought in the winter (12 battles or 31.6%).

year
Majority of the battles were fought in the year 299 (52.6% or 20 battles) and the second in the year 300 (28.9% or 11 battles). The remainder year 298 had only 7 battles (18.4%).

Multiple dimensional view

We will examine using multiple variables to see how they play together.

  • attacker_king vs attacker_outcome
  • defender_king vs attacker_outcome
  • battle_type vs attacker_outcome
  • summer vs battle_type vs attacker_outcome
  • battle_type vs attacker_king vs attacker_outcome
  • battle_type vs defender_king vs attacker_outcome
In [21]:
df_grouped = battles_df.groupby(by=['attacker_king']).agg(
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    attacker_size_std = ('attacker_size', 'std'),
    defender_size_mean = ('defender_size','mean'),
    defender_size_std = ('defender_size','std')).reset_index().sort_values(by = 'attacker_outcome_flag_count', ascending = False)

df_grouped['attacker_outcome_loss'] = df_grouped['attacker_outcome_flag_count'] - df_grouped['attacker_outcome_wins']
df_grouped['attacker_outcome_wins_pct'] = (df_grouped['attacker_outcome_wins']/df_grouped['attacker_outcome_flag_count']) * 100
df_grouped['attacker_outcome_loss_pct'] = 100 - df_grouped['attacker_outcome_wins_pct']
df_grouped
Out[21]:
attacker_king attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean attacker_size_std defender_size_mean defender_size_std attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
1 Joffrey/Tommen Baratheon 14 13.0 4329.857143 6880.377023 2558.571429 3686.319526 1.0 92.857143 7.142857
3 Robb Stark 10 8.0 4121.900000 5524.536832 4962.500000 7070.647955 2.0 80.000000 20.000000
0 Balon/Euron Greyjoy 7 7.0 183.428571 372.955251 0.000000 0.000000 0.0 100.000000 0.000000
4 Stannis Baratheon 4 2.0 8875.000000 8086.769029 8862.500000 8214.354813 2.0 50.000000 50.000000
2 Mance Rayder 1 0.0 100000.000000 NaN 1240.000000 NaN 1.0 0.000000 100.000000
In [22]:
# Group data by attacker_king and calculate win, loss, win percentage, loss percentage, attacker size mean, defender size mean

df_grouped[['attacker_king','attacker_outcome_wins','attacker_outcome_loss','attacker_outcome_wins_pct',
            'attacker_outcome_loss_pct','attacker_size_mean','defender_size_mean']].sort_values(
            by='attacker_outcome_wins_pct',ascending=False).round(1).rename(
            columns={"attacker_king": "Attacker King", "attacker_outcome_wins": "Wins", 
                     "attacker_outcome_loss": "Loss", "attacker_outcome_wins_pct":"Win Percentage",
                     "attacker_outcome_loss_pct" : "Loss Percentage",
                     "attacker_size_mean":"Attacker Size Mean",
                     "defender_size_mean":"Defender Size Mean"})
Out[22]:
Attacker King Wins Loss Win Percentage Loss Percentage Attacker Size Mean Defender Size Mean
0 Balon/Euron Greyjoy 7.0 0.0 100.0 0.0 183.4 0.0
1 Joffrey/Tommen Baratheon 13.0 1.0 92.9 7.1 4329.9 2558.6
3 Robb Stark 8.0 2.0 80.0 20.0 4121.9 4962.5
4 Stannis Baratheon 2.0 2.0 50.0 50.0 8875.0 8862.5
2 Mance Rayder 0.0 1.0 0.0 100.0 100000.0 1240.0
In [23]:
# Plot barplot by attacker_king with attacker_outcome_wins_pct and attacker_outcome_loss_pct
df_grouped[['attacker_king','attacker_outcome_wins_pct','attacker_outcome_loss_pct']].plot.bar(x='attacker_king')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x132e45d50>

The Greyjoys have won all their seven battles as the attacking king with a significantly small army size of 183 while Mance Rayder has lost his only battle with 100K man attacking a defending army of only 1240! It looks like a small yet nimble army can definitely win a battle if they don’t have anyone to fight against!

Joffrey/Tommen Baratheon have won most of the battles (13 or 92.9%) with an average attacking army size of 4,330 and fought against a defending army of 2,559.

Robb Stark has won 80% of his battles (8 out of 10 battles) with an average attacking army size of 4122 and fought against an defending army size of 4,963.

Lastly, Stannis Baratheon won 50% of his battle (2 out of 4 battles) with an average attacking army size of 8875 and fought against an defending army size of 8,862. What is interesting is that the attacking and defending army sizes are very similar.

In [24]:
# Plot barplot by attacker_king with attacker_size_mean and defender_size_mean
df_grouped[['attacker_king','attacker_size_mean','defender_size_mean']].plot.bar(x='attacker_king')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x12a5562d0>
In [25]:
# Remove Mance Rayder as it is an outlier
# Plot barplot by attacker_king with attacker_size_mean and defender_size_mean
df_grouped[['attacker_king','attacker_size_mean','defender_size_mean']][df_grouped.attacker_king != 'Mance Rayder'].plot.bar(x='attacker_king')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x12c9578d0>

Battle year

Analyze data broken by the year

In [26]:
df_year = battles_df.groupby(by=['year']).agg(
    battles = ('name','count'),
    major_death = ('major_death', 'sum'),
    major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'year', ascending = True)
    
df_year.plot.bar(x='year')
display(df_year)
year battles major_death major_capture
0 298 7 4.0 3.0
1 299 20 8.0 6.0
2 300 11 1.0 2.0

Battle vs Region

In [27]:
sns.countplot(x='region',hue='attacker_king', data = battles_df)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1307eae50>
In [28]:
df_region = battles_df.groupby(by=['region']).agg(
    battles_count = ('name','count'),
    major_death = ('major_death', 'sum'),
    major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'battles_count', ascending = False)
    
df_region.plot.bar(x='region')
display(df_region)
region battles_count major_death major_capture
4 The Riverlands 17 6.0 6.0
2 The North 10 1.0 2.0
5 The Stormlands 3 1.0 0.0
6 The Westerlands 3 2.0 1.0
1 The Crownlands 2 2.0 1.0
3 The Reach 2 0.0 0.0
0 Beyond the Wall 1 1.0 1.0

Battle_Type vs Kings

Attacker Win/Loss Percentage — Battle Type

In [29]:
# count battles by battle_type
pd.value_counts(battles_df['battle_type']).plot.bar()
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x12c607450>

There are four types of battles:

Pitched battle (14 battles)
Siege (12 battles)
Ambush (10 battles)
Razing (2 battles)

Pitched battle, sieges, and ambushes are common battles that the houses faces. On rare occasion do houses face a razing but when they do, there are no major deaths or major captures.

In [30]:
df_battle_type = battles_df.groupby(by=['battle_type']).agg(
    battles_count = ('name','count'),
    major_death = ('major_death', 'sum'),
    major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'battles_count', ascending = False)
    
df_battle_type.plot.bar(x='battle_type')
display(df_battle_type)
battle_type battles_count major_death major_capture
1 pitched battle 14 5.0 3.0
3 siege 12 2.0 4.0
0 ambush 10 6.0 4.0
2 razing 2 0.0 0.0

Pitched battle, sieges, and ambushes are common battles that the houses faces. On rare occasion do houses face a razing but when they do, there are no major deaths or major captures.

In [31]:
# Count battles by region
pd.value_counts(battles_df['region']).plot.bar()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x12555d990>
In [32]:
# Count battles by attacker_1
pd.value_counts(battles_df['attacker_1']).plot.bar()
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x123a84650>
In [33]:
# Count battles by defender_1
pd.value_counts(battles_df['defender_1']).plot.bar()
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x12d13d310>
In [34]:
# Count battles by summer
pd.value_counts(battles_df['summer']).plot.bar()
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x12d200f10>
In [35]:
# Count battles by attacker_size
battles_df['attacker_size'].hist(bins=20)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x12d5659d0>
In [36]:
# Count battles by defender_size
battles_df['defender_size'].hist(bins=20)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x12c8e76d0>
In [37]:
# Count battles by attack_houses
battles_df['attack_houses'].hist(bins=10)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x12d0787d0>
In [38]:
# Count battles by defender_houses
battles_df['defender_houses'].hist(bins=10)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x12a44a590>
In [39]:
# Count battles by attacker_commander_count
battles_df['attacker_commander_count'].hist(bins=10)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x130952110>
In [40]:
# Count battles by defender_commander_count
battles_df['defender_commander_count'].hist(bins=10)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x12c6793d0>

battle_type vs attacker_outcome

Aggregate by battle_type and count of attacker_outcome_flag, sum of attacker_outcome_flag, mean of attacker_size and defender_size.

In [41]:
df_battle_type = battles_df.groupby(by=['battle_type']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_battle_type['attacker_outcome_loss'] = df_battle_type['attacker_outcome_flag_count'] - df_battle_type['attacker_outcome_wins']
df_battle_type['attacker_outcome_wins_pct'] = (df_battle_type['attacker_outcome_wins']/df_battle_type['attacker_outcome_flag_count']) * 100
df_battle_type['attacker_outcome_loss_pct'] = 100 - df_battle_type['attacker_outcome_wins_pct']
df_battle_type
Out[41]:
battle_type battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
1 pitched battle 14 14 10.0 6910.285714 4167.857143 4.0 71.428571 28.571429
3 siege 12 12 10.0 9791.666667 2453.333333 2.0 83.333333 16.666667
0 ambush 10 10 10.0 2437.700000 3434.500000 0.0 100.000000 0.000000
2 razing 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
In [42]:
# Plot attacker_outcome_wins_pct by battle_type
sns.factorplot(x="battle_type", y="attacker_outcome_wins_pct",
            aspect=0.8,
            kind="bar", data=df_battle_type)
Out[42]:
<seaborn.axisgrid.FacetGrid at 0x12d6306d0>

When Kings attack, pitched battles are won 71.4% of the time and sieges are won 83.3% of the time. Ambushes and razings are won 100% by the attacking kings

Summer vs battle_type vs attacker_outcome

Aggregate by summer, battle_type, attacker_outcome and count of attacker_outcome_flag, sum of attacker_outcome_flag, mean of attacker_size and defender_size.

In [43]:
df_battle_type_summer = battles_df.groupby(by=['summer','battle_type']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_battle_type_summer['attacker_outcome_loss'] = df_battle_type_summer['attacker_outcome_flag_count'] - df_battle_type_summer['attacker_outcome_wins']
df_battle_type_summer['attacker_outcome_wins_pct'] = (df_battle_type_summer['attacker_outcome_wins']/df_battle_type_summer['attacker_outcome_flag_count']) * 100
df_battle_type_summer['attacker_outcome_loss_pct'] = 100 - df_battle_type_summer['attacker_outcome_wins_pct']
df_battle_type_summer
Out[43]:
summer battle_type battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
4 1.0 pitched battle 11 11 7.0 8385.818182 4740.909091 4.0 63.636364 36.363636
3 1.0 ambush 10 10 10.0 2437.700000 3434.500000 0.0 100.000000 0.000000
2 0.0 siege 7 7 5.0 15928.571429 1348.571429 2.0 71.428571 28.571429
5 1.0 siege 5 5 5.0 1200.000000 4000.000000 0.0 100.000000 0.000000
0 0.0 pitched battle 3 3 3.0 1500.000000 2066.666667 0.0 100.000000 0.000000
1 0.0 razing 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
In [44]:
#
sns.factorplot(x="battle_type", y="attacker_outcome_wins_pct",
            col="summer", aspect=1,
            kind="bar", data=df_battle_type_summer)
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x12d94a250>

During the winter, pitched battles and razings are won 100% of the time by the attacking king while during summer, only ambushes and sieges are won 100% of the time. There are no ambushes during the winter nor are there razings in the summer time.

In the winter, sieges are won 71.4% of the time while in the summer, pitched battles are won 63.6% of the time.

Attacker Win/Loss Percentage — Battle Type

battle_type vs. attacker_king

Aggregate by attacker_king, battle_type, attacker_outcome and count of attacker_outcome_flag, sum of attacker_outcome_flag, mean of attacker_size and defender_size.

In [45]:
df_battle_attacker_king = battles_df.groupby(by=['attacker_king','battle_type']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_battle_attacker_king['attacker_outcome_loss'] = df_battle_attacker_king['attacker_outcome_flag_count'] - df_battle_attacker_king['attacker_outcome_wins']
df_battle_attacker_king['attacker_outcome_wins_pct'] = (df_battle_attacker_king['attacker_outcome_wins']/df_battle_attacker_king['attacker_outcome_flag_count']) * 100
df_battle_attacker_king['attacker_outcome_loss_pct'] = 100 - df_battle_attacker_king['attacker_outcome_wins_pct']
df_battle_attacker_king.sort_values('attacker_king', ascending = True)
Out[45]:
attacker_king battle_type battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
0 Balon/Euron Greyjoy ambush 2 2 2.0 142.000000 0.000000 0.0 100.000000 0.000000
1 Balon/Euron Greyjoy pitched battle 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
3 Balon/Euron Greyjoy siege 2 2 2.0 500.000000 0.000000 0.0 100.000000 0.000000
2 Balon/Euron Greyjoy razing 1 1 1.0 0.000000 0.000000 0.0 100.000000 0.000000
5 Joffrey/Tommen Baratheon pitched battle 6 6 5.0 8333.333333 5000.000000 1.0 83.333333 16.666667
6 Joffrey/Tommen Baratheon siege 5 5 5.0 1300.000000 40.000000 0.0 100.000000 0.000000
4 Joffrey/Tommen Baratheon ambush 3 3 3.0 1372.666667 1873.333333 0.0 100.000000 0.000000
7 Mance Rayder siege 1 1 0.0 100000.000000 1240.000000 1.0 0.000000 100.000000
8 Robb Stark ambush 5 5 5.0 3995.000000 5745.000000 0.0 100.000000 0.000000
9 Robb Stark pitched battle 3 3 1.0 7081.333333 6966.666667 2.0 33.333333 66.666667
10 Robb Stark siege 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
11 Stannis Baratheon pitched battle 2 2 1.0 12750.000000 3725.000000 1.0 50.000000 50.000000
12 Stannis Baratheon siege 2 2 1.0 5000.000000 14000.000000 1.0 50.000000 50.000000
In [46]:
chart = sns.catplot(x="attacker_king", y="attacker_outcome_wins_pct",
                       col="battle_type", aspect=1,
                       kind="bar", data=df_battle_attacker_king)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[46]:
<seaborn.axisgrid.FacetGrid at 0x1309f6d10>

When these battles are broken into the attacking kings against the attacking king win percentage, we see that the Grejoys have been won every type of battle.

The Starks do not do well in pitched battle as they lost 2/3 of the time as attackers. However, they win all the time in sieges and ambushes as attackers.

Meanwhile, Stannis Baratheon won 50% of the pitched battles and sieges.

Lastly, Joffrey/Tommen Baratheon won 83.3% of pitched battles but all the sieges and ambushes.

Defender Win/Loss Percentage

battle_type vs. defender_king

Aggregate by defender_king, battle_type, attacker_outcome and count of attacker_outcome_flag, sum of attacker_outcome_flag, mean of attacker_size and defender_size.

In [47]:
df_battle_defender_king = battles_df.groupby(by=['defender_king','battle_type']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_battle_defender_king['attacker_outcome_loss'] = df_battle_defender_king['attacker_outcome_flag_count'] - df_battle_defender_king['attacker_outcome_wins']
df_battle_defender_king['attacker_outcome_wins_pct'] = (df_battle_defender_king['attacker_outcome_wins']/df_battle_defender_king['attacker_outcome_flag_count']) * 100
df_battle_defender_king['attacker_outcome_loss_pct'] = 100 - df_battle_defender_king['attacker_outcome_wins_pct']
df_battle_defender_king
Out[47]:
defender_king battle_type battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
8 Robb Stark pitched battle 6 6 5.0 8333.333333 5000.000000 1.0 83.333333 16.666667
2 Joffrey/Tommen Baratheon ambush 5 5 5.0 3995.000000 5745.000000 0.0 100.000000 0.000000
7 Robb Stark ambush 5 5 5.0 880.400000 1124.000000 0.0 100.000000 0.000000
3 Joffrey/Tommen Baratheon pitched battle 4 4 1.0 10500.000000 6812.500000 3.0 25.000000 75.000000
5 Joffrey/Tommen Baratheon siege 3 3 2.0 1666.666667 2666.666667 1.0 66.666667 33.333333
9 Robb Stark siege 3 3 3.0 1833.333333 0.000000 0.0 100.000000 0.000000
10 Stannis Baratheon siege 3 3 2.0 34000.000000 480.000000 1.0 66.666667 33.333333
0 Balon/Euron Greyjoy pitched battle 2 2 2.0 2372.000000 550.000000 0.0 100.000000 0.000000
1 Balon/Euron Greyjoy siege 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
4 Joffrey/Tommen Baratheon razing 1 1 1.0 0.000000 0.000000 0.0 100.000000 0.000000
6 Renly Baratheon siege 1 1 1.0 5000.000000 20000.000000 0.0 100.000000 0.000000
In [48]:
chart = sns.catplot(x="defender_king", y="attacker_outcome_loss_pct",
                       col="battle_type", aspect=1,
                       kind="bar", data=df_battle_defender_king)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[48]:
<seaborn.axisgrid.FacetGrid at 0x125563990>

When these battles are broken into the defending kings against the attacking king loss percentage, we see that the Joffrey/Tommen Baratheon won 75% of the pitched battles as defenders but only survived 1/3 of sieges.

Robb Stark only won 1/6 of pitched battles while losing all the ambushes and sieges as defenders.

Stannis Baratheon also survive 1/3 of sieges while losing all the pitched battles.

attacker_king vs defender_king

Aggregate by defender_king, defender_king, attacker_outcome and count of attacker_outcome_flag, sum of attacker_outcome_flag, mean of attacker_size and defender_size.

In [49]:
df_attack_defend = battles_df.groupby(by=['attacker_king','defender_king']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_attack_defend['attacker_outcome_loss'] = df_attack_defend['attacker_outcome_flag_count'] - df_attack_defend['attacker_outcome_wins']
df_attack_defend['attacker_outcome_wins_pct'] = (df_attack_defend['attacker_outcome_wins']/df_attack_defend['attacker_outcome_flag_count']) * 100
df_attack_defend['attacker_outcome_loss_pct'] = 100 - df_attack_defend['attacker_outcome_wins_pct']
df_attack_defend
Out[49]:
attacker_king defender_king battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
4 Joffrey/Tommen Baratheon Robb Stark 10 10 9.0 5861.800000 3562.000000 1.0 90.000000 10.000000
8 Robb Stark Joffrey/Tommen Baratheon 9 9 7.0 4552.777778 5413.888889 2.0 77.777778 22.222222
2 Balon/Euron Greyjoy Robb Stark 4 4 4.0 321.000000 0.000000 0.0 100.000000 0.000000
1 Balon/Euron Greyjoy Joffrey/Tommen Baratheon 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
5 Joffrey/Tommen Baratheon Stannis Baratheon 2 2 2.0 1000.000000 100.000000 0.0 100.000000 0.000000
10 Stannis Baratheon Joffrey/Tommen Baratheon 2 2 0.0 13000.000000 7625.000000 2.0 0.000000 100.000000
0 Balon/Euron Greyjoy Balon/Euron Greyjoy 1 1 1.0 0.000000 0.000000 0.0 100.000000 0.000000
3 Joffrey/Tommen Baratheon Balon/Euron Greyjoy 1 1 1.0 0.000000 0.000000 0.0 100.000000 0.000000
6 Mance Rayder Stannis Baratheon 1 1 0.0 100000.000000 1240.000000 1.0 0.000000 100.000000
7 Robb Stark Balon/Euron Greyjoy 1 1 1.0 244.000000 900.000000 0.0 100.000000 0.000000
9 Stannis Baratheon Balon/Euron Greyjoy 1 1 1.0 4500.000000 200.000000 0.0 100.000000 0.000000
11 Stannis Baratheon Renly Baratheon 1 1 1.0 5000.000000 20000.000000 0.0 100.000000 0.000000
In [50]:
# Plot attacker_king and defender_king by battles_count
chart = sns.catplot(x="attacker_king", y="battles_count",
                       col="defender_king", aspect=1,
                       kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[50]:
<seaborn.axisgrid.FacetGrid at 0x13388a050>
In [51]:
# Plot attacker_king vs attacker_outcome_loss_pct split by defender_king
chart = sns.catplot(x="attacker_king", y="attacker_outcome_loss_pct",
                       col="defender_king", aspect=1,
                       kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[51]:
<seaborn.axisgrid.FacetGrid at 0x12fa27550>
In [52]:
# Plot attacker_king vs attacker_outcome_wins_pct split by defender_king
chart = sns.catplot(x="attacker_king", y="attacker_outcome_wins_pct",
                       col="defender_king", aspect=1,
                       kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[52]:
<seaborn.axisgrid.FacetGrid at 0x12e5a7f90>

Battle Size

In [53]:
# Plot attacker_king vs attacker_size and remove Mance Rayder
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[battles_df.attacker_king != 'Mance Rayder'], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
Out[53]:
[Text(0, 0, 'Balon/Euron Greyjoy'),
 Text(0, 0, 'Robb Stark'),
 Text(0, 0, 'Joffrey/Tommen Baratheon'),
 Text(0, 0, 'Stannis Baratheon')]
In [54]:
# Plot attacker_king vs defender_size and remove Mance Rayder
chart = sns.boxplot(x="attacker_king", y="defender_size", data=battles_df[battles_df.attacker_king != 'Mance Rayder'], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
Out[54]:
[Text(0, 0, 'Balon/Euron Greyjoy'),
 Text(0, 0, 'Robb Stark'),
 Text(0, 0, 'Joffrey/Tommen Baratheon'),
 Text(0, 0, 'Stannis Baratheon')]
In [55]:
# Plot attacker_king vs attacker_size, filter to attacker_outcome == 'win' and remove Mance Rayder
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome == 'win')], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
Out[55]:
[Text(0, 0, 'Balon/Euron Greyjoy'),
 Text(0, 0, 'Robb Stark'),
 Text(0, 0, 'Joffrey/Tommen Baratheon'),
 Text(0, 0, 'Stannis Baratheon')]
In [56]:
# Plot attacker_king vs attacker_size, filter to attacker_outcome == 'loss' and remove Mance Rayder
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome == 'loss')], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
Out[56]:
[Text(0, 0, 'Robb Stark'),
 Text(0, 0, 'Joffrey/Tommen Baratheon'),
 Text(0, 0, 'Stannis Baratheon')]

Game of Thrones League

If these battles and kings were formatted like a sports league, it would look like above.

In [57]:
attack_df = battles_df[['attacker_king','attacker_outcome_flag']].rename(columns={"attacker_king":"king","attacker_outcome_flag":"flag"})
defend_df = battles_df[['defender_king','attacker_outcome_flag']].rename(columns={"defender_king":"king","attacker_outcome_flag":"flag"})

defend_df['flag'] = defend_df['flag'].map({1: 0, 0: 1})

df_attack_defend = attack_df.append(defend_df,ignore_index=True)

league_df = df_attack_defend.groupby(by=['king']).agg(
    battle_count = ('king','count'),
    win_count = ('flag','sum')
    ).reset_index().sort_values(by = 'battle_count', ascending = False)

league_df['loss_count'] = league_df['battle_count'] - league_df['win_count']
league_df['win_pct'] = ((league_df['win_count']/league_df['battle_count'])*100)
league_df['loss_pct'] = ((league_df['loss_count']/league_df['battle_count'])*100)

# round values
league_df = league_df.sort_values('win_pct',ascending=False).round(1)
league_df[['win_count','loss_count']] = league_df[['win_count','loss_count']].astype('int64')

display(league_df)
display(league_df[['king','win_pct']].plot.bar(x='king'))
king battle_count win_count loss_count win_pct loss_pct
0 Balon/Euron Greyjoy 11 7 4 63.6 36.4
1 Joffrey/Tommen Baratheon 27 17 10 63.0 37.0
5 Stannis Baratheon 7 3 4 42.9 57.1
4 Robb Stark 24 9 15 37.5 62.5
2 Mance Rayder 1 0 1 0.0 100.0
3 Renly Baratheon 1 0 1 0.0 100.0
<matplotlib.axes._subplots.AxesSubplot at 0x12d9ffd10>

Balon/Euron Grejoy and Joffrey/Tommen Baratheon are almost neck to neck in winning percentage (63.6% vs 63.0% respectively) while Stannis Baratheon has a win rate of 42.9% and Robb Stark has a win rate of 37.5%.

This makes Balon/Euron Grejoy a winner in terms of winning percentage.

Conclusion

Even the Greyjoys have a 100% success rate in the battles when they attack, I would not declare them as a powerhouse since they have not fought with a defending army (based on the lack of data) so I would disqualify them.

Joffrey/Tommen Baratheon wins the most battles against defensive armies!

In [58]:
![horse](../img/horse.jpg)
/bin/sh: -c: line 0: syntax error near unexpected token `../img/horse.jpg'
/bin/sh: -c: line 0: `[horse](../img/horse.jpg)'

Predictive Modeling

There are two types of models that I will be performing to answer the two questions:

2) What is the expected size of the defending army given the size of the attacking army? → Linear Regression Model

3) What factors contribute to a battle victory? → Classification Model

Linear Regression Model — Predicting Opposing Army Size

For creating a linear regression, the attacker size will be plotted against the defender size. I want to understand the relationship between the two variables.

For this dataset, outliers and army size of 0 are removed before modeling.

In [59]:
# Plot attacker_size vs defender_size
sns.regplot(x='attacker_size',y='defender_size',data=battles_df)
display(battles_df[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.00000 0.17306
defender_size 0.17306 1.00000
In [60]:
# Plot attacker_size vs defender_size while remove 'Mance Rayder' and having an army size greater than 0
battles_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0)]

sns.regplot(x='attacker_size',y='defender_size',data=battles_df1)
display(battles_df1[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.000000 0.438731
defender_size 0.438731 1.000000
In [61]:
# Plot attacker_size vs defender_size using Plotly
px.scatter(battles_df1, x='attacker_size', y='defender_size', 
                         hover_name="name",
                         hover_data=["name","year","attacker_outcome","attacker_king", "defender_king"],
                         trendline="ols")

The graph shows a weak linear correlation of 0.438731 and correlation coefficient of 0.192 (R-Square) which means that only 19.2 % of the variance is explained in the dependent variable. There are two distinct sections in the plots so breaking up the data into groups might give a better linear relationship between the army sizes.

In [62]:
battles_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0)]
sns.lmplot(x='attacker_size', y='defender_size', data=battles_df1[['attacker_size','defender_size']],
           robust=True, ci=None, scatter_kws={"s": 80})
display(battles_df1[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.000000 0.438731
defender_size 0.438731 1.000000
In [63]:
battles_win_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome_flag == 1) & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0) & (battles_df.attacker_size < 14000) & (battles_df.defender_size < 20000)]
sns.lmplot(x='attacker_size', y='defender_size', data=battles_win_df1[['attacker_size','defender_size']],
           robust=True, ci=None, scatter_kws={"s": 80})
display(battles_df1[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.000000 0.438731
defender_size 0.438731 1.000000

Attacker Size vs Defender Size — Split by Battle Type

In [64]:
# Split by battle_type
sns.lmplot(x='attacker_size', y='defender_size', data=battles_df1, col = "battle_type",palette="Set1")
display(battles_df1[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.000000 0.438731
defender_size 0.438731 1.000000
In [65]:
battles_win_df2 = battles_df1[(battles_df1['battle_type'] == 'ambush') | (battles_df1['battle_type'] =='pitched battle')]

battle_type_list = battles_win_df2['battle_type'].unique()
for i in range(len(battle_type_list)):
    print('battle_type = ' +  battle_type_list[i])
    display(battles_win_df2.loc[battles_win_df2['battle_type'] == battle_type_list[i],['attacker_size','defender_size']].corr())
battle_type = ambush
attacker_size defender_size
attacker_size 1.000000 0.916012
defender_size 0.916012 1.000000
battle_type = pitched battle
attacker_size defender_size
attacker_size 1.000000 0.684157
defender_size 0.684157 1.000000
In [66]:
def scatter_plotter(df,x,y):
    ''' Creates a scatter plot with the independent and dependent variables 
    with the ability to scale and remove outliers.
    
    :param df: dataframe
    :param x: independent variable
    :param y: dependent variable 
    '''    
    df = df[(df[x] > 0)]
    battle_type_list = df['battle_type'].unique()
    for i in range(len(battle_type_list)):
        fig = px.scatter(df[(df.battle_type == battle_type_list[i])], x=x, y=y, 
                         hover_name="name",
                         hover_data=["name","year","attacker_outcome","attacker_king", "defender_king"],
                         trendline="ols")
        fig.update_layout(title= str(x) + ' vs ' + str(y) + ' - ' + str(battle_type_list[i]))
        fig.show()
In [67]:
scatter_plotter(battles_win_df2,'attacker_size','defender_size')

Ambush — The linear model improves significantly when splitting the data by battle types. Here, the correlation coefficient (R-Square)is 0.839 which means that only 83.9 % of the variance is explained in the dependent variable filtered to ambush battle. This model could be used to estimate the size of the defending army if the attacking house is involved in an ambush but realistically, a defending army would not ready for this type of attack.

Pitched Battle — The linear model improves significantly when splitting the data by battle types. Here, the correlation coefficient (R-Square) is 0.468071 which means that only 46.8 % of the variance is explained in the dependent variable filter to ambush battle. There is somewhat a strong correlation of 0.68 so this model could be used to estimate the size of the defending army if the attacking house is involved a pitched battle. There was not a good model generated for the siege battle due to lack of data.

Conclusion

If an army is involved in an ambush or a pitched battle, they can use the linear model to estimate the size of the opposing army but using the ambush model would be more accurate than the pitched battle model.

In [68]:
![undead](../img/undead.jpg)
/bin/sh: -c: line 0: syntax error near unexpected token `../img/undead.jpg'
/bin/sh: -c: line 0: `[undead](../img/undead.jpg)'

Classification Modeling

Now comes the fun part of the post. I will predict the battle outcomes using various classification models:

  • Logistic Regression (Baseline Model)
  • Random Forest
  • XGBoost

There are less than 40 records in the battles dataset so I will have to clean the dataset as much as I can.

In [69]:
 # Drop columns with missing data
model_df = battles_df.drop(columns = ['name','location','attacker_commander','defender_commander','attacker_outcome','attacker_outcome_flag','battle_size'])
attacker_outcome = battles_df['attacker_outcome_flag']
In [70]:
model_df.dtypes
Out[70]:
year                          int64
attacker_king                object
defender_king                object
attacker_1                   object
defender_1                   object
battle_type                  object
major_death                 float64
major_capture               float64
attacker_size               float64
defender_size               float64
summer                      float64
region                       object
attack_houses                 int64
defender_houses               int64
attacker_commander_count      int64
defender_commander_count      int64
dtype: object
In [71]:
model_df[model_df.columns[0:10]].head()
Out[71]:
year attacker_king defender_king attacker_1 defender_1 battle_type major_death major_capture attacker_size defender_size
0 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark ambush 0.0 1.0 20.0 0.0
1 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister ambush 1.0 0.0 100.0 100.0
2 299 Robb Stark Balon/Euron Greyjoy Stark Greyjoy pitched battle 0.0 0.0 244.0 900.0
3 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark ambush 0.0 0.0 264.0 0.0
4 299 Joffrey/Tommen Baratheon Robb Stark Bolton Stark ambush 1.0 0.0 618.0 2000.0
In [72]:
model_df[model_df.columns[11:23]].head()
Out[72]:
region attack_houses defender_houses attacker_commander_count defender_commander_count
0 The North 1 1 1 1
1 The Riverlands 1 1 3 1
2 The North 1 1 2 1
3 The North 1 1 1 0
4 The North 2 1 2 3
In [73]:
# Add underscore to spaces
model_df[['attacker_king','defender_king','attacker_1','defender_1']] = model_df[['attacker_king','defender_king','attacker_1','defender_1']].replace('/','_',regex=True)
model_df[['attacker_king','defender_king','region','battle_type','attacker_1','defender_1']] = model_df[['attacker_king','defender_king','region','battle_type','attacker_1','defender_1']].replace(' ','_',regex=True)
model_df[model_df.columns[0:10]].head()
Out[73]:
year attacker_king defender_king attacker_1 defender_1 battle_type major_death major_capture attacker_size defender_size
0 299 Balon_Euron_Greyjoy Robb_Stark Greyjoy Stark ambush 0.0 1.0 20.0 0.0
1 299 Robb_Stark Joffrey_Tommen_Baratheon Stark Lannister ambush 1.0 0.0 100.0 100.0
2 299 Robb_Stark Balon_Euron_Greyjoy Stark Greyjoy pitched_battle 0.0 0.0 244.0 900.0
3 299 Balon_Euron_Greyjoy Robb_Stark Greyjoy Stark ambush 0.0 0.0 264.0 0.0
4 299 Joffrey_Tommen_Baratheon Robb_Stark Bolton Stark ambush 1.0 0.0 618.0 2000.0
In [74]:
model_df[model_df.columns[10:24]].head()
Out[74]:
summer region attack_houses defender_houses attacker_commander_count defender_commander_count
0 1.0 The_North 1 1 1 1
1 1.0 The_Riverlands 1 1 3 1
2 1.0 The_North 1 1 2 1
3 1.0 The_North 1 1 1 0
4 1.0 The_North 2 1 2 3
In [75]:
# Get list of categorical variables
categorical_feature_mask = model_df.dtypes==object
categorical_cols = model_df.columns[categorical_feature_mask].tolist()
categorical_cols
Out[75]:
['attacker_king',
 'defender_king',
 'attacker_1',
 'defender_1',
 'battle_type',
 'region']
In [76]:
# Convert categorical variables into binary variables such as Attacker King, Defender King, Attacker 1, Defender 1, battle type, and region
model_df1 = pd.get_dummies(model_df, columns=categorical_cols, prefix = categorical_cols)
model_df1.head()
Out[76]:
year major_death major_capture attacker_size defender_size summer attack_houses defender_houses attacker_commander_count defender_commander_count ... battle_type_pitched_battle battle_type_razing battle_type_siege region_Beyond_the_Wall region_The_Crownlands region_The_North region_The_Reach region_The_Riverlands region_The_Stormlands region_The_Westerlands
0 299 0.0 1.0 20.0 0.0 1.0 1 1 1 1 ... 0 0 0 0 0 1 0 0 0 0
1 299 1.0 0.0 100.0 100.0 1.0 1 1 3 1 ... 0 0 0 0 0 0 0 1 0 0
2 299 0.0 0.0 244.0 900.0 1.0 1 1 2 1 ... 1 0 0 0 0 1 0 0 0 0
3 299 0.0 0.0 264.0 0.0 1.0 1 1 1 0 ... 0 0 0 0 0 1 0 0 0 0
4 299 1.0 0.0 618.0 2000.0 1.0 2 1 2 3 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 54 columns

In [77]:
model_df1.columns
Out[77]:
Index(['year', 'major_death', 'major_capture', 'attacker_size',
       'defender_size', 'summer', 'attack_houses', 'defender_houses',
       'attacker_commander_count', 'defender_commander_count',
       'attacker_king_Balon_Euron_Greyjoy',
       'attacker_king_Joffrey_Tommen_Baratheon', 'attacker_king_Mance_Rayder',
       'attacker_king_Robb_Stark', 'attacker_king_Stannis_Baratheon',
       'defender_king_Balon_Euron_Greyjoy',
       'defender_king_Joffrey_Tommen_Baratheon',
       'defender_king_Renly_Baratheon', 'defender_king_Robb_Stark',
       'defender_king_Stannis_Baratheon', 'attacker_1_Baratheon',
       'attacker_1_Bolton', 'attacker_1_Bracken',
       'attacker_1_Brave_Companions', 'attacker_1_Brotherhood_without_Banners',
       'attacker_1_Darry', 'attacker_1_Free_folk', 'attacker_1_Frey',
       'attacker_1_Greyjoy', 'attacker_1_Lannister', 'attacker_1_Stark',
       'defender_1_Baratheon', 'defender_1_Blackwood', 'defender_1_Bolton',
       'defender_1_Brave_Companions', 'defender_1_Darry', 'defender_1_Greyjoy',
       'defender_1_Lannister', 'defender_1_Mallister',
       'defender_1_Night's_Watch', 'defender_1_Stark', 'defender_1_Tully',
       'defender_1_Tyrell', 'battle_type_ambush', 'battle_type_pitched_battle',
       'battle_type_razing', 'battle_type_siege', 'region_Beyond_the_Wall',
       'region_The_Crownlands', 'region_The_North', 'region_The_Reach',
       'region_The_Riverlands', 'region_The_Stormlands',
       'region_The_Westerlands'],
      dtype='object')

Log transform defender_size and attacker_size

Log transform attacker size and defender size in order to reduce the variance in the data

In [78]:
# Log-transform the skewed features
skewed = ['attacker_size', 'defender_size']
features_log_transformed = pd.DataFrame(data = model_df1)
features_log_transformed[skewed] = model_df1[skewed].apply(lambda x: np.log(x + 1))
In [79]:
battles_df['attacker_size'].hist(bins=20)
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x134f46910>
In [80]:
features_log_transformed['attacker_size'].hist(bins=20)
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x135067410>
In [81]:
battles_df['defender_size'].hist(bins=20)
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x1351a8610>
In [82]:
features_log_transformed['defender_size'].hist(bins=20)
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x132f765d0>

Normalizing Numerical Features

Normalize numerical features such as count of attacker houses, count of defender houses, count of attacker commander, count of defender commander, and the log transformed attacker and defender sizes from the prior step

In [83]:
# attack_houses,defender_houses,attacker_commander_count,defender_commander_count,attacker_size,defender_size
# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['attack_houses','defender_houses','attacker_commander_count','defender_commander_count','attacker_size','defender_size']

features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])

# Show an example of a record with scaling applied
display(features_log_minmax_transform.head(n = 5))
year major_death major_capture attacker_size defender_size summer attack_houses defender_houses attacker_commander_count defender_commander_count ... battle_type_pitched_battle battle_type_razing battle_type_siege region_Beyond_the_Wall region_The_Crownlands region_The_North region_The_Reach region_The_Riverlands region_The_Stormlands region_The_Westerlands
0 299 0.0 1.0 0.264444 0.000000 1.0 0.000000 0.5 0.166667 0.142857 ... 0 0 0 0 0 1 0 0 0 0
1 299 1.0 0.0 0.400864 0.466007 1.0 0.000000 0.5 0.500000 0.142857 ... 0 0 0 0 0 0 0 1 0 0
2 299 0.0 0.0 0.477833 0.686977 1.0 0.000000 0.5 0.333333 0.142857 ... 1 0 0 0 0 1 0 0 0 0
3 299 0.0 0.0 0.484649 0.000000 1.0 0.000000 0.5 0.166667 0.000000 ... 0 0 0 0 0 1 0 0 0 0
4 299 1.0 0.0 0.558338 0.767544 1.0 0.333333 0.5 0.333333 0.428571 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 54 columns

In [84]:
# Export data for modeling
features_final = features_log_transformed
features_final.to_csv('../data/battles_data_model.csv',index = False)

Shuffle and split data

I used 75% of the data as training data (28 records) and 25% as test data (10 records) and used them in the three models. I also used 20 k-fold cross-validation which is a resampling procedure used to evaluate machine learning models on a limited data sample.

In [85]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    attacker_outcome, 
                                                    random_state = 0,
                                                    test_size = 0.25)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Training set has 28 samples.
Testing set has 10 samples.

Model Selection

  • Logistic Regression

  • Random Forest

  • XG Boost

Logistic Regression

In [86]:
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)

logistic_acc = accuracies.mean()
logistic_std = accuracies.std()

display(logistic_acc)
display(logistic_std)
0.9166666666666666
0.2327373340628157

The accuracy using logistic regression is 91.6% with a standard deviation of 0.2327 and we have a good baseline classification model already which is good news.

Random Forest

In [87]:
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy',random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)

rf_acc = accuracies.mean()
rf_std = accuracies.std()

display(rf_acc)
display(rf_std)
0.95
0.11902380714238082
In [88]:
# TODO: Extract the feature importances using .feature_importances_ 
importances = classifier.feature_importances_

# Plot
vs.feature_plot(importances, X_train, y_train)

The accuracy using random forest is 95.0% with a standard deviation of 0.119. The accuracy improved by approximately 3.6% and the standard deviation decreased by almost half.

XGBoost

In [89]:
# Fitting XGBoost to the Training set
from xgboost import XGBClassifier
classifier = XGBClassifier(random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)

xgb_acc = accuracies.mean()
xgb_std = accuracies.std()

display(xgb_acc)
display(xgb_std)
0.925
0.15343293866268307

Surprisingly, XGBoost performed worst than the Random Forest models (92.5%) but it is still a good model.

Random Forest has the best accuracy of 95%. The features importance are attacker_size, attacker_commander_count,attack_houses,defender_size, and defender_houses.

In [90]:
model_data = {'Model':  ['Logistic Regression', 'Random Forest', 'XGBoost'],
        'Accuracy': [logistic_acc, rf_acc, xgb_acc],
        'Standard Deviation': [logistic_std,rf_std,xgb_std]
        }

model_df = pd.DataFrame(model_data, columns = ['Model','Accuracy','Standard Deviation'])
model_df['Accuracy'] = model_df['Accuracy'].astype(float)
model_df['Standard Deviation']= model_df['Standard Deviation'].astype(float)

model_df = model_df.sort_values('Accuracy',ascending=False)

display(model_df)
chart = sns.catplot(x="Model", y="Accuracy",kind="bar", data=model_df)
Model Accuracy Standard Deviation
1 Random Forest 0.950000 0.119024
2 XGBoost 0.925000 0.153433
0 Logistic Regression 0.916667 0.232737

Implementation - Extracting Feature Importance

I reduced the dataset into the top five importance features and plugged it into the Random Forest model again.

In [91]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy',random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
display(accuracies.mean())
display(accuracies.std())
0.95
0.11902380714238082
In [92]:
# TODO: Extract the feature importances using .feature_importances_ 
importances = classifier.feature_importances_
vs.feature_plot(importances, X_train, y_train)
In [93]:
# Import functionality for cloning a model
from sklearn.base import clone

# Reduce the feature space
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]

# Train on the "best" model found from grid search earlier
clf = classifier.fit(X_train_reduced, y_train)

# Make new predictions
reduced_predictions = clf.predict(X_test_reduced)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = clf, X = X_train_reduced, y = y_train, cv = 20)

rfr_acc = accuracies.mean()
rfr_std = accuracies.std()

display(rfr_acc)
display(rfr_std)
0.9666666666666666
0.1
In [94]:
importances = clf.feature_importances_
vs.feature_plot(importances, X_train_reduced, y_train)

After we reduced the features down to the 5 most significant predictors, we get a slight improvement in accuracy of 96.6% and the standard deviation decreased slightly to 0.1. This is okay as we now have a model that can predict battle outcomes for Game of Thrones!

The most important factors are:

attacker_size - The size of the attacking house matter (unless you are Balon/Euron Greyjoy who wins with smaller armies)
attacker_commander_count - The count of the attacker commander matters as well.
attack_houses - The number of attacking houses that are in the battle.
defender_size - The size of the defender house matter (unless you are Balon/Euron Greyjoy who wins with smaller armies)
defender_houses - The number of defending houses that are in the battle.

In [95]:
model_data = {'Model':  ['Logistic Regression', 'Random Forest', 'XGBoost','Random Forest Reduced'],
        'Accuracy': [logistic_acc, rf_acc, xgb_acc, rfr_acc],
        'Standard Deviation': [logistic_std,rf_std,xgb_std, xgb_std]
        }

model_df = pd.DataFrame(model_data, columns = ['Model','Accuracy','Standard Deviation'])
model_df['Accuracy'] = model_df['Accuracy'].astype(float)
model_df['Standard Deviation']= model_df['Standard Deviation'].astype(float)

model_df = model_df.sort_values('Accuracy',ascending=False)

display(model_df)
chart = sns.catplot(x="Model", y="Accuracy",kind="bar", data=model_df)
display(chart.set_xticklabels(rotation=45, horizontalalignment='right'))
Model Accuracy Standard Deviation
3 Random Forest Reduced 0.966667 0.153433
1 Random Forest 0.950000 0.119024
2 XGBoost 0.925000 0.153433
0 Logistic Regression 0.916667 0.232737
<seaborn.axisgrid.FacetGrid at 0x133825750>

The Random Forest Reduced model improved to 96.6% and the standard deviation improved slightly. It looks like we got our model to predict battle outcomes!

Recap

1) Joffrey/Tommen Baratheon wins the most battles against defensive armies as the Greyjoys never had to fight a defending army of more than zero.

2) If an army is involved in an ambush or a pitched battle, they can use the linear model to estimate the size of the opposing army but using the ambush model would be more accurate than the pitched battle model.

3) Random Forest model can be used to predict battle outcome with an accuracy of 96.6% and the important factors that determine a victory are attacker size, attacker commander count, attack houses, defender size, defender houses.

In [96]:
![dragon3](../img/dragon3.jpg)
/bin/sh: -c: line 0: syntax error near unexpected token `../img/dragon3.jpg'
/bin/sh: -c: line 0: `[dragon3](../img/dragon3.jpg)'
In [ ]: